wiki_dumps: strip the 86 page from public dumps#547
Merged
Conversation
The public dump invites bots and AI scrapers to ingest the whole wiki (see index.json's note_to_bots), but dumpBackup.php exports the 86 page along with everything else. That undoes the robots.txt Disallow/Noindex on /wiki/86 -- the dump becomes the larger exposure vector. Add a post-export filter that drops excluded base titles plus their subpages (Base/...) and Talk pages (Talk:Base, Talk:Base/...) before publishing. Titles are configurable via wiki_dumps_exclude_titles (default: 86), kept in sync with roles/mediawiki/files/robots.txt. The EXCLUDE_TITLES env var is passed through the systemd service unit. If the filter fails the dump aborts (set -e) rather than publishing an unfiltered file, so latest.xml.gz keeps pointing at the last good dump.
56459f6 to
53be3ee
Compare
Member
Author
Collaborator
|
Hmm, yea, without having some wiki database attributes that can be used to filter, hacks may be required. I did a quick look at the backup PHP code, I don't see any obvious exclude list features. |
jetpham
approved these changes
Jun 24, 2026
SuperQ
approved these changes
Jun 25, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
The daily public dump at https://dumps.noisebridge.net/ is still leaking this page.
dumpBackup.php --publicexports every publicly-readable page, and the 86 page has no read restriction on the wiki — so it lands inlatest.xml.gz.Worse,
index.jsonactively invites bots and AI scrapers to ingest the dump ("Please use these dumps instead of hitting the live site").What
files/dump_filter.py— streams the gzipped dump and drops any<page>whose title is an excluded base, a subpage (86/…), or a Talk page (Talk:86,Talk:86/…), then rewrites a clean gzip (namespaces preserved, nons0:prefixes).files/wiki_dump.sh— exports to.raw.gz, filters into the final file. If the filter fails the script aborts (set -e) rather than publishing an unfiltered dump, solatest.xml.gzkeeps pointing at the last good dump.tasks/main.yml— deploys the filter and passesEXCLUDE_TITLESinto the cron job.defaults/main.yml—wiki_dumps_exclude_titles: ["86"], with a comment to keep it in sync withroles/mediawiki/files/robots.txt.Testing
Ran the filter against a synthetic export: it stripped
86,86/2023, andTalk:86while correctly keepingMain Pageand868 HAYES(no false-prefix match). Output XML stayed byte-clean.Notes
/usr/local/sbin/wiki_dumpby hand on the host after the playbook runs.